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Abstract 

Neutral landscapes and mutational robustness are believed to be im- 
portant enablers of evolvability in biology. We apply these concepts to 
software, defining mutational robustness to be the fraction of random mu- 
tations that leave a program's behavior unchanged. Test cases are used 
to measure program behavior and mutation operators are taken from ge- 
netic programming. Although software is often viewed as brittle, with 
small changes leading to catastrophic changes in behavior, our results 
show surprising robustness in the face of random software mutations. 

The paper describes empirical studies of the mutational robustness 
of 22 programs, including 14 production software projects, the Siemens 
benchmarks, and 4 specially constructed programs. We find that over 
30% of random mutations are neutral with respect to their test suite. 
The results hold across all classes of programs, for mutations at both the 
source code and assembly instruction levels, across various programming 
languages, and are only weakly related to test suite coverage. We con- 
clude that mutational robustness is an inherent property of software, and 
that neutral variants (i.e., those that pass the test suite) often fulfill the 
program's original purpose or specification. 

Based on these results, we conjecture that neutral mutations can be 
leveraged as a mechanism for generating software diversity. We demon- 
strate this idea by generating a population of neutral program variants 
and showing that the variants automatically repair unknown bugs with 
high probability. Neutral landscapes also provide a partial explanation for 
recent results that use evolutionary computation to automatically repair 
software bugs. 



1 Introduction 

The ability of biological organisms to maintain functionality across a wide range 
of environments and to adapt to new environments is unmatched by engineered 
systems. Understanding the intertwined mechanisms and evolutionary drivers 
that have led to the robustness and evolvability of biological systems is an 
important subficld of evolutionary biology. In this paper we focus on neutral 
landscapes and mutational robustness, applying these concepts to software. 
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Modern software arose through fifty years of continued use, appropriation, 
and refinement by software developers. The tools, design patterns and codes 
that we have today are those that proved useful and were robust to software 
developer's edits, hacks and accidents, and those that survived the economic 
pressures of the marketplace. We hypothesize that these evolutionary pressures 
have caused software to acquire mutational robustness resembling that of natural 
systems, where mutational robustness is believed to be intimately related to the 
capacity for unsupervised evolution and adaptation to new environments. We 
posit that software mutational robustness points to the potential for similarly 
powerful methods of unsupervised software enhancement and evolution. 

Robustness is important in software engineering, especially as it relates to 
reliability, availability or dependability. Here we focus on genetic robustness, 
defining software mutational robustness in terms of changes to computer code. 
In this context, we define a neutral mutation to be random change applied to 
a program representation (source code, abstract syntax tree, assembly, binary, 
etc.) such that the mutated program's behavior is unchanged on its regression 
test suite0 Thus, software fitness is assessed by the program's performance on 
a set of test cases that specify correct input/output pairs. We present a formal 
definition of software mutational robustness as the fraction of a set of first order 
mutation operations that can be applied to a software artifact without changing 
its behavior on a set of tests (see Section [XT]) . 

Software mutational robustness measures the fraction of software mutants 
that are neutral. Neutral variants are equivalent to the original program with 
respect to the test suite. They may or may not be semantically equivalent 
(compute the same function) to the original program, they may or may not 
have the same nonsemantic properties (run-time, memory consumption, etc.), 
and they may or may not satisfy the specification (required behavior) of the 
original designers. Empirically, we find that the program's test suite is an 
acceptable proxy for the program specification. We find many neutral variants 
that are both semantically distinct from the original program and still satisfy 
the original program's specification or intended behavior. 

This result can be understood more easily when one considers that there are 
an infinite number of ways to encode any algorithm in software. For example, 
consider this fragment of a recursive quick-sort implementation; 

if (right > left) { 
// code elided . . . 
qui ck (left, r); 
qui ck ( 1 , right ) ; 

} 

Swapping the order of the last two statements to 

quick (1 , right ) ; 
qui ck (left, r); 



^This definition is not to be confused with the "equivalent mutants" of mutation testing, 
see Section l2.3.1l 
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changes the run-time behavior of the program without changing the output, 
giving an alternate implementation of the specification. We find that these kinds 
of neutral mutations are prevalent in software and contribute to evolvability, as 
discussed in Section 

Our mutation operators (delete, copy, and swap) are described in detail in 
Section |31 They are notable because they do not create new code de novo. 
Delete and copy are both plausible analogs of genetic operations on DNA, and 
all three are edit operations that are routinely performed by programmers. They 
are related to operators commonly used in the genetic programming community, 
although we note that we don't have an explicit terminal set. In effect, our 
terminal set corresponds to all of the statements contained in the program being 
studied. 

We are interested in the extent to which mutational robustness enables soft- 
ware evolvability by which we mean the use of automated methods for software 
development and maintenance. In particular, there is increasing interest in 
automatic program repair, and many of the more promising approaches rely 
on unsound program transformations (Section 12.4. ip and may involve source- 
level edits (e.g., [HI [23l |60l [62]) or modifying certain values (program state) 
at run-time (e.g., [42] ) . Mutational robustness may help explain why program 
transformations, such as swapping two statements [62| or clamping an integer 
value [35], can produce acceptable program behavior. 

The primary contributions of the paper include: 

1. The empirical measurement of software mutational robustness in a large 
collection of off-the-shelf software, demonstrating that mutational robust- 
ness is prevalent. We find largely uniform mutational robustness scores 
with an average value of 36.8% and a minimum across all software in- 
stances of 21.2%. We evaluate this claim using 22 programs involving 
over 150,000 lines of code and 23,151 tests. 

2. An application of software mutational robustness to the problem of proac- 
tively repairing software defects. As an illustration, we seeded bugs into 
11 programs, generated populations of neutral variants using mutation, 
and studied the behavior of the population on the bugs with held-out test 
cases. In 8 of the programs, the neutral population contained at least one 
variant that "repaired" the latent bug and passed the held-out test case. 

3. An extension of the software engineering technique known as mutation 
testing to include neutral mutations and software robustness. We discuss 
the relation between software functionality, software test suites and spec- 
ifications, demonstrating the value of neutral mutations, both for proac- 
tively repairing unknown bugs and as an enabler of automatic software 
repair. 

In the remainder of the paper, we first review relevant related work in Bi- 
ology and Software Engineering (Section [5]). We then describe our software 
representations and mutation operators in Section |3J The diverse benchmark 
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suite of software instances used are described in Section Wl] and the software's 
existing test suites (both high and low quahty) are used unaftered. The exper- 
imental design and experimental results are given in Section 21 We present an 
example application of software mutational robustness in Section [S] Finally, we 
analyze our results and discuss their implications in Sections [B] and [T] 

2 Background 

The previous work most closely related to software mutational robustness in- 
cludes work on neutral theories in biology, investigations of the effect of neutral- 
ity in fitness landscapes in evolutionary computation and the field of mutation 
testing in software engineering. In the following three subsections we highlight 
some of the most relevant aspects of these fields. 

2.1 Biology 

Biological evolution is understood in terms of the interplay between genotype 
and phenotype. The genotype is the informational representation that specifies 
the organism, and the phenotype is the physical appearance and behavior of 
organisms interacting with their environment. There is a corresponding type 
of robustness for each of these levels of description: mutational robustness and 
environmental robustness respectively [28]. Mutational robustness is the or- 
ganism's ability to maintain phenotypic traits in the face of internal genetic 
mutations, and environmental robustness is the ability to maintain functional- 
ity across a wide range of environments [55] . 

The two types of robustness are closely related. Many of the causes of 
mutational robustness are also causes of environmental robustness [34j . It is 
thought that the pervasive mutational robustness observed in biological systems 
may have arisen as a by-product of evolutionary pressure for environmental 
robustness [36] . However mutational robustness has been shown to be beneficial 
in its own right, especially in its impact upon an organism's evolvability [9l |4T]. 

Over time, populations of organisms accumulate mutations in their genome. 
Of the many mutations that occur in a single individual, only a tiny fraction 
spread to fixation in the population. Mutations accumulate at a fairly constant 
rate known as the genetic clock [66J. Initially, only those mutations which 
increased fitness were thought to become fixed in the population, an idea known 
as selectionism. In 1968 Kimura suggested that because populations have finite 
size, the majority of accumulated mutations might be effectively neutral, with 
no impact upon fitness |27j . Kimura notes that as a consequence of neutral 
mutation, "we must recognize the great importance of random genetic drift 
due to finite population number in forming the genetic structure of biological 
populations." 

Recent work [14] estimates that roughly 50% of the fixed mutations provide 
a selective advantage in Drosophila fruit files which have effective population 
sizes on the order of 10^, while in hominids, with effective populations sizes of 
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10^, close to 0% of fixed mutations are selective. In this study, roughly 16% 
of non-equivalent mutations in Drosophila were found to be effectively neutral 
compared to roughly 30% of non-equivalent mutations in hominids. Although 
direct comparison is certainly not warranted here, it is notable that these num- 
bers are of the same order of magnitude as our results for software. 

The variants of an organism produced through neutral mutations are called 
"neutral neighbors." Connected sets of neutral neighbors are called "neutral 
spaces" [55] and can occupy sizeable regions of an organism's fitness landscape 
[IS]. Mutational robustness and the resulting neutral spaces in fitness land- 
scapes are believed to contribute to an population's ability to evolve [H]. 

Mutational drift through neutral spaces gives populations access to new 
phenotypes located along the mutational border of the neutral space. Neutral 
spaces thus allow populations to increase diversity and to accumulate new ge- 
netic material. This accumulation of genetic material has been shown to be 
required for large evolutionary innovations [HI EH [ST] |3Z] • 

In effect, mutational robustness of an organism is a metric of the organ- 
ism's fitness landscape. A number of metrics of fitness landscapes have been 
devised in an attempt to statistically characterize landscapes [19] and to di- 
rectly measure those properties of a landscape which encourage evolution [49] . 
Our proposed metric for software begins the work of applying such metrics to 
real- world software. 

2.2 Evolutionary Computation 

The role of neutrality in evolutionary computation has been explored in specific 
contexts by earlier work. In simple linear GP systems whose fitness landscapes 
have similarities to those of RNA, neutral mutations have been shown to enhance 
exploration in evolutionary searches [J. In the evolution of digital circuits, 
neutrality was shown to be beneficial, both in retrospective analysis of successful 
experiments [TH] , and in directed experiments using synthetic fitness landscapes 
designed with variable amounts of neutrality [54]. Some GP methods such as 
Cartesian genetic programming [38] have been explicitly designed to leverage 
neutrality in genetic search. 

Using a population genetics model, varying levels of mutational robustness 
have been shown to either inhibit or encourage evolution, depending on pop- 
ulation size, mutation rate, and fitness landscape |13| . Some studies (e.g., us- 
ing a GA to optimize robot movement [50]) suggest that with certain complex 
genotype-phenotype mappings, periods of neutral evolution do not measurably 
increase a population's evolvability. Although none of this prior work studies 
neutrality in software per se, it does suggest that success in evolving software 
(e.g., repairing bugs) may be related to neutral landscapes in the space of pro- 
gram representations. 
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2.3 Software Engineering 

A number of subfields of software engineering, including mutation testing, n- 
version programming, and program transformation are particularly relevant to 
this work. 

2.3.1 Mutation Testing 

The software engineering community has studied randomly generated program 
mutants for over 30 years under the mantle of "Mutation Testing" ; however, the 
interpretation and use of program mutants has been limited to the discovery of 
inadequate test suites. In their landmark review of mutation testing Jia and 
Harmon describe the field as follows j22| . 

Mutation Testing is a fault-based testing technique which pro- 
vides a testing criterion called the "mutation adequacy score." The 
mutation adequacy score can be used to measure the effectiveness of 
a test set in terms of its ability to detect [mutants] faults. 

In mutation testing mutants are assumed to be faulty. Thus, mutants that 
pass a program's test suite are taken as test suite failures and lower the test 
suite's "mutation adequacy score," a type of coverage metric. Mutation test- 
ing recognizes the possibility of "equivalent mutants" which are semantically 
identical to the original program (cf. the Equivalent Mutant Problem jS^) and 
viewed as problematic. The mutation testing literature, however, does not ac- 
knowledge the existence of the neutral mutants that are semantically different 
from the original program but still satisfy the program's specification (and its 
test suite). 

Figure [1] shows the syntactic space surrounding a program. This is similar 
to a fitness landscape. The five possible types of program mutants are shown. 
In the following sections, we study the class of neutral mutants that have pre- 
viously been ignored by the software engineering community. Our biological 
interpretation of neutral mutants, and ideas for how neutral mutations can be 
leveraged, separate this research from the field of mutation testing despite ap- 
parent similarities in methodology and technique. 

2.4 N- Version Programming 

There has been considerable research on the use of automated diversity in se- 
curity, for example, the special issue of IEEE Computer Security devoted to IT 
Monocultures 1 . Common mechanisms for introducing diversity include Ad- 
dress Space Randomization [6l [15] and Instruction Set Randomization [U [25] , 
among many others. In these applications, diversity is introduced to reduce the 
risk of widely replicated attacks, by forcing the attacker to redesign the attack 
each time it is applied [64 . Our proposed use of diversity, outlined in Section [5] 
is closer in spirit to n-variant systems [11] , where multiple variants of a program 
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Program Syntactic Space 




Figure 1: Syntactic Space of a Program. The set of programs satisfying the program 
specification are shown in blue (left), the set of programs passing the program's test 
suite are shown in red (right), and the set of equivalent programs are shown in green 
(center). Three classes of mutants are shown and labeled. 

are run in parallel, giving each variant identical inputs and checking that they 
all behave similarly before forwarding the output to the user. 

Our proposed application also resembles n-version programming |35j where 
multiple independent, or quasi- independent, manually written implementations 
of critical programs reduce the risk of implementation errors going undetected. 
Our approach differs from n-version programming: Instead of relying on teams 
of human programmers to generate full implementations independently, we use 
lightweight mutation operators to automatically generate variants that are se- 
mantically similar to the original. This addresses the cost issue identified in 
earlier studies j29j and potentially addresses the assumption that independent 
teams of humans are likely to generate programs that will fail independently; 
studies suggest that this assumption does not hold in practice [JT]. Because 
we generate variants automatically, there is a better chance of achieving inde- 
pendence among the variations, either with the mutation operators we describe 
here or with others to be developed in the future. 

2.4.1 Unsound Program Transformation 

Traditionally, automated program transformation techniques (e.g., compiler op- 
timizations) refrain from altering the semantics (or behavior) of the original 
program. Such program transformations are "sound" because they are guar- 
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anteed to preserve program semantics. Recent work has experimented with 
"unsound" program transformations, which do not guarantee to preserve the 
exact semantics of the original program. 

Examples of unsound program transformations include failure oblivious com- 
puting j43], in which common memory errors, such as out-of-bounds reads and 
writes, are either ignored or automatically re-mapped from invalid memory ad- 
dresses to arbitrary valid memory address, allowing the program to continue 
executing. Such trade-offs of robustness for correctness are desirable for appli- 
cations such as web servers where availability is paramount. 

Loop perforation [39] is a run-time method that scarifices exact program 
accuracy in favor of reduced running time or energy consumption. Loop perfo- 
ration eliminates some of the computation specified in the original program by 
dropping some iterations of selected program loops. 

Another important class of unsound program transformations are automated 
repair techniques. These techniques share a common approach: defining a no- 
tion of correct and incorrect program behavior (e.g., from test cases [BU H^ . 
implicit specifications [521, explicit specifications [SO]); generating a set of 
candidate repair transformations (e.g., at random [62], by constraint solving, or 
from an established set [231111]), and validating the candidates produced by the 
transformations until a suitable repair is found. 

3 Technical Approach 

In this section, we define software mutational robustness, describe the program 
representations used in our experiments, and for each representation specify the 
representation-specific mutation operations. 

3.1 Software Mutational Robustness 

We formalize software mutational robustness with respect to a software program 
P (a member of the set of all software programs V), a set of mutation operators 
M (where each m e M is a function mapping V V), and a test suite T : V ^ 
{true, false}. A program P is said to pass the test suite iff T{P) — true. 

Given a program P, a set of mutation operators M, and a test suite T 
such that T{P) = true, we define the software mutational robustness, written 
MutRB{P,T, M), to be the fraction of all first-order mutants of P which pass 
T: 



^ ' \{P' I 3m e M. P' ^ m{P)}\ 

Based on this definition, software mutational robustness depends on three 
parameters, P, T and M . Perhaps surprisingly, the empirical results of Section 
m show that software robustness does not depend strongly on P or T. Our 
mutation operators M , described below, are adapted from genetic programming 
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and are simple and natural analogs of biological mutation. We believe that 
they are also general and appropriate to software. For example, earlier work 
has shown that the set M, together with crossover, is sufficiently strong to 
generate successful repairs for a wide variety of defects in a wide variety of 
software [BU [321 131] • Additionally, they reflect common human edit operations. 

3.2 Representation and Operators 

We consider two levels of program representation: abstract syntax trees (AST) 
based on high-level source, and low- level assembly code (ASM). We use the 
CiL toolkit 130] to parse and manipulate ASTs of C source code. CiL simpli- 
fies some C constructs to facilitate manipulation by computational tools and 
supports source to source translations such as our mutation operations. The 
ASM representation is the linear sequence of instructions taken directly from 
the compiled . s assembly code file produced by standard compilation (i.e., "gcc 
-02 -S") on a 64-bit Intel platform, which is split on line breaks [45] , but with 
directives and other pseudo-operations protected from mutation. Our choice of 
one tree-based and one linear representation increases our confidence that the 
results do not depend on such representation details. For example, our AST 
representation is at the statement level. That is, each node in the AST cor- 
responds to a legal statement in C. This relatively coarse representation level 
provides a distinct contrast to the fine-grained ASM representation. 

Given a source code or assembly language program, we consider three simple 
language-independent mutation operators: copj/, delete and swap. Copy copies 
an AST statement-level subtree or assembly instruction and inserts it at a ran- 
domly chosen child position in a randomly chosen statement-level subtree or 
immediately after a randomly chosen instruction. Delete removes a randomly 
chosen statement-level AST subtree or assembly instruction. Swap exchanges 
two randomly chosen statement-level AST subtrees or assembly instructions. 
Figure [D illustrates these operators. Because AST mutations manipulate sub- 
trees, a large amount of code might be inserted or deleted by a single mutation, 
depending on how high in the tree the mutation is applied. In the experiments, 
mutations modify only AST statements or ASM instructions that are actually 
visited by the test suite. This restriction is similar to mutating only those parts 
of genome that are known to be involved in the phenotype being assayed. Muta- 
tions to untested statements would likely be neutral under our metric, unfairly 
biasing the results towards overly high estimates of mutational robustness. 

4 Experimental Results 

We report results for five experiments on the mutational robustness of programs 
in both representations (AST and ASM). We seek to establish: (1) the level of 
mutational robustness in software; (2) the extent to which mutational robustness 
depends on or can be explained by test suite quality; (3) how mutations can be 
neutral by developing a taxonomy of successful neutral mutations, (4) the effect 
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(a) Copy AST 



(b) Delete AST 



(c) Swap AST 



movq 8C'/,rdx) , '/,rdi 
xorl '/,eax, °/,eax 

movq -S0(7rbp) , 7.rdx 
addl $1, '/,rl4d 
call atoi 

movq -80(y.rbp) , "/.rdx 
movl '/.eax, C/.rlS) 
addq $4, '/,rl8 



movq 8{7,rdx) , '/.rdi 
xorl '/,eax , '/,eax 

movq -SOr/.rbp) , Xrdx 
^ addl $1, •/.rl4d 
call atoi 

movq °/,rdx, -SOC'/.rbp) 
movq -80(7,rbp), '/.rdx 
movl '/.eax, ("/.rlS) 
addq $4, 7.rl8 



(d) Copy ASM 



movq 8('/,rdx), '/,rdi 
xorl 7,eax, 7.eax 
movq '/.rdx, -80('/,rbp) 
addl $1, •/.rl4d 
call atoi 

movq -80(7,rbp), 7,rdx 
movl 7.eax, C7.rl5) 
addq $4, 7.rl5 



movq 8C'/.rdx) , 7.rdi 
xorl 7,eax, 7.eax 
addl $1, 7.rl4d 
* call atoi 
movq 7,rdx, -80C7irbp) 
movl '/.eax, C7.rl5) 
addq $4, 7.rl5 



(e) Delete ASM 



movq SC'/.rdx), '/,rdi 
xorl '/.eax, 7,eax 

movq y.rdx, -80(7.rbpJ 
addl $1, 7,rl4d 
call atoi 

movq -BOC'/.rbp), '/.rdx 
movl '/.eax, (7.rl8) 
addq $4, 7.rl8 



movq 8C7irdx), 7irdi 
xorl 7.eax, 7,eax 
movq -80C7irbp) , 7irdx 

^addl $1, 7.rl4d 

call atoi 

movq 7.rdx, -S0(7.rbp) 
movl 7.eax, (y.rlS) 
addq $4, 7.rl8 



(f) Swap ASM 



Figure 2: Mutation operators: Copy, Delete, Swap. 



of cumulative neutral mutations (i.e., higher-order) mutations, and (5) the effect 
of multiple programming languages and paradigms on mutational robustness. 



4.1 Benchmark Programs 

We selected 22 programs for our experiments (Table [I]). Fourteen are off-the- 
shelf programs selected to measure mutational robustness in real-world soft- 
ware. Four are taken from the Siemens Software-artifact Infrastructure Repos- 
itory, created by Siemens Research |20, and later modified by Rotliermel and 
Harold [44] until each "executable statement, edge, and definition-use pair in 
the base program or its control fiow graph was exercised by at least 30 tests" . 
The space test suite, which was generated by Vokolos [55^ and later enhanced 
by Graves [T7], covers every edge in the control fiow graph with at least 30 
tests. These programs are included for comparability to previous research and 
to study the robustness of programs with extremely high quality test suites. We 
include four simple sorting algorithms taken from |http : //rosettacode . org to 



demonstrate the robustness of programs with full statement, branch-level and 
assembly instruction test coverage. 

Each program has an associated test suite. The tests either came with the 
program as part of its established regression test (e.g., Siemens, potion, redis. 
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Lines 




Test 




Mutational 




Program 


of Code 


Suite 




Robustness 






ASM 


C 


# Test % Stmt. 


AST ASM 


Sorting Algorithms 


bubble- sort 


184 


34 


10 


100 


27.3 


25.7 


insertion-sort 


170 


29 


10 


100 


29.4 


26.0 


merge- sort 


233 


38 


10 


100 


29.8 


21.2 


quick-sort 


219 


38 


10 


100 


28.9 


25.5 


Siemens '20 f 


printtokens 


2419 


536 


4130 


81.7 


21.2 


25.8 


schedule 


922 


412 


2650 


94.4 


34.4 


29.1 


space 


18098 


9126 


13494 


91.1 


37.7 


32.1 


teas 


544 


173 


1608 


96.2 


33.5 


25.9 


Systems Programs 


bzip2 1.0.2 


18756 


7000 


6 


35.9 


33.0 


26.1 


— (alt. test suite) 






22 


71.0 


46.4 


23.6 


ccrypt 1.2 


15261 


4249 


6 


29.5 


33.0 


69.7 


— (alt. test suite) 






16 


40.4 


34.6 


69.7 


grep 


28776 


10929 


119 


24.9 


50.0 


36.7 


imagemagick 6.5.2 


6128 


147 


145 


0.8 


33.3 


66.3 


jansson 1.3 


6830 


2975 


30 


28.8 


33.3 


28.0 


leukocyte 


40226 


7970 


5 


45.4 


33.3 


39.9 


lighttpd 1.4.15 


34165 


3829 


11 


40.1 


61.5 


56.9 


nuUhttpd 0.5.0 


5951 


5575 


6 


64.5 


41.5 


37.8 


oggenc 1.0.1 


299959 


59094 


10 


38.4 


33.4 


22.1 


— (alt. test suite) 






40 


58.8 


40.5 


72.3 


potion 40b5f03 


80406 


15033 


204 


48.4 


33.3 


48.9 


redis 1.3.4 


44802 


17203 


234 


9.2 


33.3 


34.0 


sed 


17026 


8059 


360 


42.0 


33.0 


25.6 


tiff 3.8.2 


22458 


1732 


10 


15.4 


33.3 


90.4 


vyquon 335426d 


20567 


4390 


5 


50.6 


33.3 


69.0 


total or average 


664100 158571 


23151 


40.9 


33.9 ±10 39.6 


±22 



Table 1: Benchmark programs with mutational robustness of first-order (one-step) 
mutations. "Lines of Code" columns report the size of the program in terms of lines 
of C source code and lines of compiled assembly code. The "Test Suite" columns show 
the size of the test suite both in terms of number of test cases and the percentage 
of all AST level statements in the program that are exercised by the test suite. The 
"Mut. Robustness" columns report the percentage of all first-order mutations that 
were neutral. The ± values in the bottom row indicate one standard deviation. For 
each program, at both the AST and ASM level we generated at least 200 unique 
variants using each of the three mutation operations {copy, delete and swap). Mutation 
operations were applied at locations chosen randomly from all those visited by the 
test cases. For three programs (bzip, ccrypt and oggenc) we also evaluated on three 
independent alternate test suites. 

t Although the Siemens benchmark suite claims complete branch and statement coverage, we 
find less than 100% statement coverage. This is due to our use of finer-grained Cil statements 
in calculating coverage. See Section 13.21 for discussion of the Cil program representation. 
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jansson) or were constructed manually (e.g., sorting algorithms, webservers). 
A number of our benchmarks implement invertible transformations (e.g., com- 
pression, encryption, serialization, image manipulation) , which form an implicit 
formal specification and permit simple testing |59| . Three of the programs 
(bzip, ccrypt and oggenc) are thus each evaluated on two independently con- 
structed test suites. For lighttpd and imagemagick, we restricted mutations 
to mod_f astcgi . c and convert . c respectively, demonstrating application to 
modules as well as to whole systems. 

4.2 Software Mutational Robustness 

We first demonstrate that a variety of software programs exhibit significant 
mutational robustness under the mutation operators described in Section [3l In 
this experiment, we measure the percentage of random mutations to code visited 
by at least one test case which leave the program's behavior unchanged on all 
of its test cases. In every case, the initial program passes all of its test cases. 
For each mutation operator we generate program variations making at least 200 
copies of the original program and then applying a single random mutation to 
each copy. We refer to these as first-order, or one-step, mutations. We then run 
the mutated program on its test suite and count it as neutral if it passes all of 
its tests. 

Table [T] shows the results of this experiment on the benchmark programs. 
We wish to rule out trivial mutations, such as the insertion of dead code (e.g., 
statements that appear after a return) or the transposition of independent lines, 
that are visible in the source code but would produce equivalent assembly code. 
Since program equivalence is undecidable [8 , we approximate this by compiling 
the AST using "gcc -02 -S," which includes dead code elimination, SSA form, 
and instruction scheduling. Multiple source code variants that produce the same 
optimized assembly code (modulo label names and other directives) are counted 
only once in Table [T] Similarly, any two ASM-level mutations which produce 
the same executable are counted only once. 

Although the results vary by program (e.g., grep is more robust than printtokens), 
the results show a remarkably high level of mutational robustness: Across all 
programs, operators, and representations (source or assembly) 36.8% of vari- 
ants continue to pass all test cases with no systematic difference between the 
AST and ASM representation. In the next two subsections we ask to what 
extent these results arise from inadequate test suites (j4.3[) or from semantically 
equivalent mutations (|4.4|) . 

4.3 Does Mutational Robustness Depend on Test Suite 
Quality? 

The results in Table [T] are striking, but they could potentially be dismissed as 
an artifact of inadequate test suites, although we explicitly limit our mutation 
operators to code that is visited by the test suite. Figure [3] plots the data 
from Table [T] to show the relation between mutational robustness and test suite 
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coverage. Across programs and test suites, mutational robustness cannot be 
entirely explained by test suite coverage. Notably, even programs with 100% 
test suite coverage still display at least 20% mutational robustness. Due to 
potential overlaps between test cases, we report results for test suite coverage 
rather than test suite size. 
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Figure 3: Mutational Robustness by Test Suite coverage (replots data from Table I). 
The X-axis marks the percent of the program which is run by the test suite and the Y- 
axis shows the calculated mutational robustness. Visual inspection shows no significant 
dependence of mutational robustness on test suite coverage and that programs with 
100% test coverage still exhibit at least 20% mutational robustness. 



One might expect mutational robustness to be inversely related to test suite 
coverage. This is not true in either limit. At one extreme, even high-quality 
test suites (such as the Siemens benchmarks, which were explicitly designed 
to test all execution paths) and test suites with full statement, branch and 
assembly instruction coverage have at least 20% mutational robustness. We 
count all compilation errors as non-neutral. At the other extreme, a minimal 
test suite that we designed for bubble sort, which does not check program output 
but requires only successful compilation and execution without crash, has only 
84.8% mutational robustness. These results show that, in practice, for both real 
programs and comprehensively tested ones, mutational robustness is not fully 
explained by test suite inadequacy. 

The quick-sort example in Section [1] demonstrates that simple mutations 
can yield different fully correct implementations. Yin et al. provide another 
example in which a fully-formally verified implementation of AES encryption 
is changed based on what was "purely an implementation decision, [where] the 
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specification did not impose any restrictions" , yielding another formally verified 
implementation |651 p. 61]. Additionally, estimates from the mutation testing 
community posit that between 5% and 15% of mutants are fully semantically 
equivalent to the original program [4 7) . These examples demonstrate that even 
with perfect validation, significant mutational robustness can still exist. This 
supports our claim that mutational robustness is inherent in software, rather 
than a function of inadequate validation, and it is at least partially explained 
by observing that for any functional program specification, there are infinitely 
many ways to implement it. 

4.4 Taxonomy of Neutral Variants 

To provide insight into our results, we chose bubble sort as an example of an 
easy-to-understand and easy-to-test program. We then studied the effect of 35 
first-order neutral mutations on bubble sort at the AST level, developing a tax- 
onomy of the neutral mutations. After manual review, all 35 neutral variations 
were confirmed to be valid implementations of the sorting specification. We next 
categorized them with respect to their operational differences from the original. 
The results are shown in Tabled] 



# 


Functional Category 


Frcqucncy/35 


1 


Different whitespace in output 


12 


2 


Inconsequential state change 


10 


3 


Extra or redundant computation 


6 


4 


Equivalent or redundant conditional guard 


3 


5 


Switched to non-explicit return 


2 


6 


Changed code is unreachable 


1 


7 


Removed optimization 


1 



Table 2: Functional taxonomy of 35 neutral, first order, AST variants of bubble sort. 
Categorized by manual review. "DifTerent whitespace in output" describes variants 
whose output differs from the original program only in whitespace characters which 
are not detected by the test suite. "Inconsequential state change" describes variants 
whose mutations change the behavior of the program while executing but do not 
change the tested output of the program, e.g., mutations which change the values of 
variables in memory which do not later affect program output. 

These categories have different effects on program execution. Only categories 

1 and 5 affect the externally observable behavior of the program by changing 
output and return values in ways not specified by the program specification. 
Categories 2,3,4 and 7 may affect the running time of the program. Category 

2 includes the removal of unnecessary variable assignments, re-ordering non- 
interacting instructions and changing state which is later overwritten or never 
again read. Many changes, such as types 2, 3 and 4, produce programs that will 
likely be more robust to further manipulation by inserting redundant (occasion- 
ally diverse) control fiow guards (i.e., conditionals that control if statements) 
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and variable assignments. Across most of these categories we find alternative 
implementations that are not semantically equivalent to the original but which 
do still conform to the program specification. 

4.5 Cumulative Robustness 
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Figure 4: Random walk in neutral landscape of Insertion Sort at ASM level. In Panel 



4(a) the size of neutral variants is not controlled while in Panel 4(b) only variants 



which are less than or equal to the length of the original program (in ASM LOG) are 
counted as neutral. On the 32-bit machine used for this experiment insertion-sort 
compiles to 175 assembly LOG. Both the average size and the mutational robustness 
of mutant variants are shown on the Y-axis, the X-axis shows the cumulative number 
of mutation operators applied. 



The previous experiments measured the percentage of first-order mutations 
which were neutral. This subsection explores the effects of accumulating cumu- 
lative neutral mutations in small assembly programs. We begin with a working 
assembly code implementation of insertion sort and generate 100 neutral vari- 
ants using the procedure described earlier. From these 100 neutral variants, we 
then generate a population of 100 second order neutral variants. This is done 
by looping through the population of first-order mutants, randomly mutating 
each individual, and retaining the result if it is neutral discarding it otherwise. 
Once 100 neutral second order variants have been accumulated they become the 
population of second order variants, and are used to generate a population of 
third order variants in the same manner. This process is repeated to generate 
neutral populations separated from the original program by successively more 
neutral mutations. Figure |4] shows the result of repeating this process for 250 
steps to produce a final population of 100 neutral variants, each 250 neutral 
mutations away from the original program. 

The results show that under this procedure mutational robustness increases 
with the mutational distance away from the original program. This is not sur- 
prising given that at each step mutationally robust variants are more likely to 
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C 


C++ 


Haskell 


OCaml 


Avg. 


Std.Dev. 


bubble 


25.7 


28.2 


27.6 


16.7 


24.6 


5.3 


insertion 


26.0 


42.0 


35.6 


23.7 


31.8 


8.5 


merge 


21.2 


46.0 


24.9 


22.7 


28.7 


11.6 


quick 


25.5 


42.0 


26.3 


11.4 


26.3 


12.5 


Avg. 


24.6 


39.5 


28.6 


18.6 


27.9 




Std.Dev. 


2.3 


7.8 


4.8 


5.7 


3.1 





Table 3: Mutational robustness of sorting algorithms at the assembly instruction level 
with 100% test suite coverage, for different algorithms and source language. 



contribute to the subsequent population. We conjecture that this result corre- 
sponds to the population drifting away from the perimeter of the programs neu- 
tral space (plateau). Similar behavior has been described for biological systems, 
in which populations in a constant environment experience a weak evolutionary 
pressure for increased mutational robustness [53 Ch. 16]. 

The average size of the program also increases with mutational distance from 
the original program Figure 4(a)[ suggesting that the program might be achiev- 
ing robustness by adding "bloat" — useless instructions. To control for bloat. 
Figure |4(b)| shows the results of an experiment which counts only neutral in- 
dividuals that are the same size or smaller (measured in number of assembly 
instructions) than the original. With this additional criterion, mutational ro- 
bustness continues to increase but the size periodically dips and rebounds. The 
dips are likely consolidation events, where additional instructions are discov- 
ered that can be eliminated. This shows that not only are the neutral spaces 
surrounding any given program implementation very large, but they are easily 
traversable through iterative mutation. Despite controlling for bloat. Figure 
|4(b)| shows a small increase in population- wide mutational robustness, however 
further experimentation is required to determine if this effect is significant. 



4.6 Multiple Languages 

The source language and compilation process impose important regularities on 
the assembly representations of programs. In this section we investigate how 
mutational robustness varies across assembly code derived from different lan- 
guages. We evaluated the mutational robustness of sorting algorithms com- 
piled from five languages that span three programming paradigms (imperative, 
object-oriented, and functional). We focused on sorting algorithms because they 
are small and sufficiently well-understood to test exhaustively: Test suites were 
hand-crafted to cover all executable assembly instructions, branches, and corner 
cases. 

The results shown in Table [3] demonstrate that mutational robustness exists 
across multiple programming languages and paradigms. It is striking that even 
for a small, exhaustively tested sorting program, 28% of the one-step mutations 
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did not change the program's functionahty. 

This experiment addresses the question of whether our results depend on id- 
iosyncrasies of a particular language implementation or programming paradigm, 
and they support the claim that mutational robustness occurs at a significant 
rate in a wide variety of software. 

5 Application 

The previous sections demonstrated the prevalence of mutational robustness in 
software, independent of language, algorithm, or test suite coverage. Section lTSl 
suggests that mutations can be accumulated in the same program without loss 
of functionality. We are interested in how this robustness might be leveraged 
in practical applications. This section gives one example of how we might use 
mutational robustness to proactively fix unknown bugs in a program, i.e., bugs 
that are not discovered by the test suite. 

The basic idea is to generate nontrivial deep changes to an algorithm or 
implementation within the neutral landscape, retaining populations of diverse 
software variants. Some variants may be immune to undiscovered bugs, auto- 
matically pinpointing fixes to new bugs as they arise. 

The term artificial diversity describes a wide variety of techniques for au- 
tomatically randomizing non-functional properties of programs with the goal 
of disrupting widely replicated attacks. Many methods have been proposed, 
including stack frame layouts, instruction set numberings, or address space lay- 
outs Although the idea of diversifying certain aspects of a program's 
implementation has been previously proposed [1], neutral mutations provide a 
much more general and practical approach. Here we extend earlier work in this 
area by generating program variants that are distinct algorithmically but neu- 
tral with respect to the test suite. These distinct variants can be used in an 
n- variant system in which a diverse population of programs are run con- 
currently in parallel on the same inputs. When the program variants contain 
algorithmic or implementation changes (rather than simple remappings, e.g., 
address space layout randomization or instruction set randomization), the ap- 
proach is known as implementation diversity [10]. Such a system may be used 
to flag potential bugs when there are discrepancies in observed behavior, and 
may be more robust to bugs which only affect subsets of the population. 

Returning to the earlier discussion of mutation testing, we note that the 
practice of mutation testing is limited because of the significant effort required 
to analyze mutants that pass the test suite. Such mutants must be manually 
classified, either as fully equivalent to the original program or non-equivalent, 
and the latter further classified as buggy or as superior to the original program 
(cf. human oracle problem [63] ) . Our proposed adaptation avoids these labor 
intensive steps by retaining a population of all such neutral variants. When a 
bug is encountered in the original program, it will be detected if some members 
of the population fail to express the bug, and the non-failing variations can 
be analyzed to suggest repairs. This approach of deferring analysis until a 
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potentially beneficial variation is found may be more feasible than traditional 
mutation testing, because it does not require exhaustive manual review. 

5.1 Repairing Bugs 

We first demonstrate that it is possible to construct variants in the neutral fitness 
landscape that can fix unknown bugs while retaining required functionality. To 
do this, we seeded each of the programs in Table |4] with five random defects 
following an established defect distribution (THl p. 5] and fault taxonomy [3D] 
(e.g., "missing conditional clause", "extra statement", "constant should have 
been variable", "wrong parameter"). The defects were seeded in advance and 
without regard to the mutation operators. For each defect we produced a held- 
out test to verify its presence or absence. 

For each program, we next generated 5,000 one-step neutral variants using 
the mutation operators defined earlier. We also noted which of these also passed 
any of the held-out test cases for the seeded bugs. In practice, 5,000 neutral 
variants proved sufficient to generate at least one bug-fixing variant for most 
programs, and if such a variant was generated at all, it was within the first 5,000 
neutral variants. We tested this by searching up to 20,000 variants, which did 
not improve performance. Only variants that passed the original test suite were 
retained; the mutation process did not have access to the held-out test cases for 
the seeded bugs. 

Table m shows the results. We say that a variant fixes an unknown bug if it 
passed all original test cases and the held-out test case associated with that bug. 
In practice we found fixes which exactly revert (repair) the original bug as well 
as compensatory mutations which do not touch the bug itself but fix or avoid 
the bug by changing other parts of the program. Specifically, we found that 3% 
of the proactive fixes changed the same line of code in which the original bug 
was seeded and 12% of fixes affect code within 5 lines of the seeded bug. The 
remaining 88% of fixes could be considered compensatory in that they do not 
affect the bug directly but rather change other portions of the program so that 
the bug is not expressed. 

These proactive repairs can be used to pinpoint the bug, because the diff 
between them and the original program can locate either the bug itself or rele- 
vant portions of the program code. Previous work demonstrated that software 
engineers take less time to address defect reports that are accompanied by such 
machine-generated patch-like information [61], which provides evidence that 
these proactive repairs would be useful in practice. 

We observe some common trends when examining the bugs fixed in table 
131 The bugs fixed most easily were those that naturally mirror the mutation 
strategies employed by our technique. For example, we found multiple exam- 
ples of the repairs that deleted problematic statements or clauses, corrected an 
incorrect value for a constant, changed a relational operator (for instance < 
to <), inserted clauses or statements to test for extra conditions, changed a 
parameter value in a function call, etc. However, there was significant overlap 
between the types of bugs that were proactively repaired and those that were 
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Table 4: Five unique bugs were seeded in each subject program according to a defect 
distribution taken from the Firefox open source project. Five thousand neutral variants 
were created for each program, without regard to the seeded bugs. Each variant passed 
all visible tests. The "Unique Bugs Fixed" column counts the number of seeded bugs 
fixed by at least one variant. The "Bug Fixes" column counts the number of variants 
that fixed at least one bug. 

not, suggesting that additional tuning might aid in the repair of these currently 
unfixed classes of bugs. One bug that was never repaired in our experiment 
involved an "incorrect function call." To repair this bug using our technique 
would require finding the "correct" function elsewhere in the program with ex- 
actly the correct parameters, while avoiding adding extraneous mutations that 
changes behavior on regression test cases. In this experiment, we focused on 
discovering the proactive repair, and we leave for future work the problem of 
automatically deciding how to resolve discrepancies that are discovered among 
selected variants, either with an automated repair or by generating additional 
test cases. 

We predict that the more latent bugs there are in a program, the more likely 
it is that at least one of them will be repaired through proactive diversity. This 
is relevant because most deployed programs have significantly more than five 
outstanding defects (e.g., 18,165 and 2,013 open bug tickets for Eclipse (V3.0) 
and Firefox (VI. 0) respectively [2]). To test this prediction, we seeded one of 
our test programs, potion, with ten additional held-out defects; Figure [5] plots 
the number of distinct bugs fixed in 5,000 neutral variants as a function of the 
number of defects seeded, yielding a correlation of 95%. If the linear relation 
shown in Figure [5] applies to the Eclipse and Firefox projects, a population of 
5,000 program variants may proactively fix roughly 9,000 and 1,000 latent bugs 
in those systems, respectively. Although preliminary, these results show how 
the mutational robustness properties of programs might be used, for example, 
as a practical advantage in the scenario outlined in the next subsection. 
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Figure 5: Given a population of 5,000 program variants we show the total number 
of unique bugs fixed (blue) and the number of diverse variants (method of variant 
selection not shown) required before the first fix is found (red). Both are shown by 
the number of bugs seeded in the subject program (potion). The Pearson correlation 
coefficient between the number bugs seeded and fixed is 0.95. 



In their classic work on independence in multiversion programming, Knight 
and Leveson found that distinct teams of humans, when given an identical 
programming task, produce solutions that have correlated errors (bugs) |31j . 
That is, two independent teams are unlikely to produce two independent im- 
plementations, thus decreasing the potential benefit of N-person programming. 
Our approach could potentially address part of the non-independence hurdle 
because our random mutations are generated by a pseudo-random number gen- 
erator rather than by people, thus producing independent algorithmic changes 
within in the mutational space. For example, if we seed 15 bugs in potion and 
then select at random ten of the proactive repairs, we see that, on average, 1.7 
different defects (rather than 1.0, if they were 100% correlated) are fixed by 
those ten. 

We leave as future work the application of this technique to potentially higher 
order neutral mutants as constructed in Section [4.5l While such variants would 
greatly increase diversity they would be less useful in directly pinpointing the 
source code implicated in buggy behavior. 

To put this somewhat unconventional idea into perspective, for the first 
month after Firefox 4.0 was released (March 11 through April 10, 2011), the 
project's Bugzilla database shows that an average of 77 new, non-duplicate 
bugs were reported per day. Given that this rate of bug reporting is more than 
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ten times the number of bugs seeded in our evaluation, we conjecture that our 
technique would be at least as effective in similar real-world systems as it is in 
our experimental setup. To summarize, our results are promising because they 
suggest that a small number of neutral variants run in parallel can pinpoint a 
relatively large number of bugs in practice. 

6 Discussion 

There are several interesting aspects of the results described in this paper, many 
of which point the way to future work: 

1. The view of software presented here explicitly challenges the prevailing 
folk wisdom that software is a precise and intentionally engineered mech- 
anism brittle to small perturbations. We find software to be inherently ro- 
bust to random mutations, malleable within extensive neutral landscapes, 
and evolvable by combining the mutation operators described here with 
crossover and selection fSH [331 ISl] ■ 

2. Currently, very little may be learned through a quantitative comparison of 
the 36.8% mutational robustness found in software to typical levels of mu- 
tational robustness in biological systems (e.g, 30% mutational robustness 
in hominids [M], or the almost 40% neutrality of gene knockouts (dele- 
tions) in yeast [51]). There are many drivers of mutational robustness in 
biological systems; environmental stability influences levels of robustness 
[57] , the centrality of a gene may infiuence its mutational robustness [23] , 
additionally evolution may either select for mutational robustness [53] , or 
be more effective in organisms which are mutationally robust [9]. There 
are analogs to each of these factors in software systems which future work 
may relate to the levels mutational robustness found in software. 

3. The quantitative results that we report certainly depend on the particular 
choice of mutation operators, and a competent programmer could likely 
craft operators capable of achieving any pre-set desired level of mutational 
robustness. However, we believe that the operators used in this paper 
are sufficiently simple, powerful and general that they expose software 
mutational robustness as an inherent property of software. A topic left 
for future investigation is to define and test a wider variety of mutation 
operators, studying their effect on software mutational robustness. 

4. As mentioned earlier, the large neutral landscapes revealed by our study 
may explain the success of evolutionary methods, such as genetic program- 
ming, on the task of automated program repair [62) . Although the par- 
allel with evolutionary biology is intriguing, we do not yet have definitive 
evidence that quantifies the role of mutational robustness in automated 
program repair, a topic that we leave for future work. We suspect that 
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other methods of automated repair (e.g., [HI [521 ISOl IS21 112]) may ulti- 
mately be understood in the context of software mutational robustness 
and evolvability. 

5. It is well known that there are an infinite number of possible implementa- 
tions for any given program specification, and Table [1] can be interpreted 
as illustrating how easy it is to find such multiple equivalent implementa- 
tions. This results also suggest that programs have a significant amount 
of unidentified and unexploited redundancy. One practical implication 
of large traversable neutral landscapes would be searching the plateaus 
for regions that minimize resource requirements such as program size (as 
demonstrated in Figure |4(b)[ ), memory requirements or runtime. This 
may explain the success of previous work optimizing graphics shader soft- 
ware [48] . In addition to the proactive diversity application described in 
Section[5l it might be feasible to incorporate other machine learning meth- 
ods into software development and maintenance processes; such methods 
typically work poorly on brittle system and require some degree of robust- 
ness while they search for improved solutions. 

6. The relation between our methodology and that of mutation testing opens 
the possibility of re-purposing the many tools developed by the mutation 
testing community. For example, runtime optimization techniques such as 
mutant scheme generation [52) enable the compilation of "super mutants" 
capable of executing all first-order mutants of a program. Such a system 
could potentially be applied to efficiently deploy and run populations of 
diverse software variants as described in Section [S] on end user systems. 
This is just one example of how the extensive toolbox of mutation test- 
ing could be applied to new applications leveraging our view of software 
mutants as beneficial rather than harmful. 

An exciting topic for future research, opened up by the results presented 
here, is an investigation of the root causes of software robustness, for example, 
the idea of short error propagation distances observed in web servers studied in 
failure oblivious computing [33] or the feedback loops and statistical operations 
used in the PARSEC benchmark suite |7] , which may become more common as 
future applications leverage multi-core architectures. 

Beyond the significance for software engineering, we believe that our results 
will be of interest to biologists. We considered the effect of repeated neutral 
mutations in a single program, showing that robustness can increase system- 
atically through population exploration of the neutral landscape; a property 
shared with biological systems [55] • We also showed that it is possible to gener- 
ate programs that are many mutational steps removed from the original while 
retaining functionality (i.e., without leaving the neutral plateau). These large 
extended neutral landscapes are thought to be essential to the ability of bio- 
logical systems to improve through natural selection [46l [53] . Unlike biological 
systems, which can be notoriously difficult to study, our software analogs may 
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eventually provide a useful experimental framework for testing out hypotheses 
about the role of neutrality in biological evolution. 

Mutational robustness has many correlates in biological systems. These 
include a correlation between environmental and mutational robustness [34[ 1281 
[55] and a correlation between mutational robustness and evolvability [SH] . The 
presence of analogous correlations in software systems is an intriguing possibility 
that could be investigated empirically. Such an investigation could indicate 
whether these relations are general across complex mutationally robust systems 
or are specific to biological systems. If such correlations do exist in software, they 
could lead to new applications, such as methods of automated hardening which 
automatically increase environmental robustness through increasing mutational 
robustness (as in Section |43)) . 

7 Conclusion 

The previous sections described experimental results, using three simple mu- 
tation operators, which show that software is surprisingly robust to random 
mutations. For the programs we tested, 37% of the mutations had no effect 
on software functionality, as measured by the programs test suites. Software 
mutational robustness, or neutrality, is observed even in programs that are is 
completely correct according to their specifications. Just as neutrality is believed 
to enhance evolvability in naturally evolving populations, so may software neu- 
trality enable and explain the evolvability of software, either through automated 
means (e.g., [52]) or by humans. 

Software robustness is potentially useful for enhancing the resilience of soft- 
ware systems. We demonstrate this idea by describing a method that increases 
software diversity, automatically generating software variants that are immune 
to as yet undiscovered bugs. The insights into software described here suggest 
several opportunities for the Software Engineering community, including the 
following: Creating system diversity, for example, to protect against security 
exploits; incorporating machine learning methods into software development 
and maintenance; improving program performance; or developing error-tolerant 
computations. 

We postulate that the presence of mutational robustness in software is not 
an effect of intentional design, but is rather an effect of software's provenance 
through natural selection — even though the agents of selection, mutation and 
reproduction are human engineers. In this way, mutational robustness can be 
viewed as a property arising through inadvertent selection in both natural and 
engineered systems. Further study of software as an evolved system may yield 
new insights into those aspects of evolution that are specific to biological systems 
and those which are general across other complex evolved, and even engineered, 
systems. Because software is fundamentally easier to instrument and observe 
than naturally occurring populations, studying software robustness may lead 
in the future to an increased understanding of the role of neutrality in natural 
evolution. 
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